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Abstract 

We study the typical learning properties of the recently introduced Soft 
Margin Classifiers (SMCs), learning realizable and unrealizable tasks, with 
the tools of Statistical Mechanics. We derive analytically the behaviour of 
the learning curves in the regime of very large training sets. We obtain ex- 
ponential and power laws for the decay of the generalization error towards 
the asymptotic value, depending on the task and on general characteristics 
of the distribution of stabilities of the patterns to be learned. The optimal 
learning curves of the SMCs, which give the minimal generalization error, are 
obtained by tuning the coefficient controlling the trade-off between the error 
and the regularization terms in the cost function. If the task is realizable 
by the SMC, the optimal performance is better than that of a hard margin 
Support Vector Machine and is very close to that of a Bayesian classifier. 
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I. INTRODUCTION 



Neural networks are models of learning systems composed of interconnected units that, 
besides their biological relevance, have been shown to be very useful for classification tasks. 
The weights of the connections are adjusted through a process called learning using a set of 
M examples. It is assumed that these are labeled following an underlying rule, usually called 
teacher. The purpose of learning is not only to classify correctly the examples of the training 
set, but also to generalize correctly on new inputs. To this aim, the network has to infer the 
teacher's rule. The quality of this inference is measured through the generalization error eg, 
which is the probability of misclassification of a new, randomly selected, input pattern. As 
eg is not a quantity available for the training process, learning is usually performed through 
the minimization of a function of the training patterns. The tools of Statistical Mechanics 
allow to study the properties of such learning systems, providing a deep understanding of 
their typical behaviour [1-5]. In particular, it has been shown that the minimization of the 
training error, that is, the fraction of training patterns misclassified by the network, does 
not necessarily provide the best generalizer [6-8]. This is why other cost functions, based 
on geometrical properties like the distance of the patterns to the discriminating surface, or 
on probabilistic error measures like the likelihood, arc used for training. 

The simplest instance of a neural network, the pcrccptron, is a single binary unit whose 
output is the sign of the weighted sum of its inputs. It can only perform linear separations 
of the patterns. If the classification task requires more complex discriminating surfaces, 
these may be implemented using feedforward networks with a layer of hidden units whose 
number is a priori unknown. The cost functions used to tackle this problem usually have 
several minima, and determining the lowest one is one of the main difficulties of learning 
with multilayer neural networks. This is also a problem for the theoretical analysis, as the 
typical properties of such networks depend crucially on the structure of the minima in the 
weights' space. 

Recently, a new learning scheme has been proposed, which strives to get rid of the 
problem raised by the multiple minima. The obtained classifiers are called Support Vector 
Machines (SVMs) [9,10]. Instead of directly looking for a comphcated discriminating surface 
in input space, the patterns are first mapped to a high dimensional feature space, where the 
rule to be learned is (hopefully) linearly separable. If this is the case, a simple pcrccptron 
can be trained to find the separation in feature space. Denoting the weights by w G 9?^, 
the perceptron's output to an input x e 3?^ is given hy a — sign(w • x + 6) where 6 is a bias 
and the dot represents the inner product in Thus, the patterns belonging to different 
classes are separated by a hyperplane orthogonal to w at distance |&|/||w|| from the origin, 
with ||w|| = a/w ■ w. The SVM's solution is the Maximal Stability Perceptron (MSP) [11] in 
feature space, also called maximal margin hyperplane. This is the hyperplane at maximal 
distance n^ax from the closest patterns in the training set. Two different formulations of 
this problem in terms of cost functions have been proposed in the literature. In the first 
one [11], the cost function counts not only the number of misclassified patterns, but also the 
number of correctly classified ones that lie at a distance smaller than k from the separating 
hyperplane: 
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where © is the Heaviside function, and 



h^, = T|^iw -x^ + b), (2) 

is called aligned field of the training pattern x^, G { — 1, 1} being its class. If the M 
A^- dimensional patterns are correctly classified, the aligned fields are all positive. The 
SVM solution has w and b corresponding to Umax, the largest possible value of k, such 
that EMSp{i^max) — 0. If the training set is not linearly separable, K^ax becomes negative. 
Notice that there are no constraints on the norm of w, that can be freely chosen. 

If the norm of the weight vector is chosen so that the aligned field of the closest pattern 
be 1, this leads to an equivalent formulation of the problem [9,10], in which the function to 
be minimized is: 

EsvM = • w, (3) 

subject to the conditions 

h^>l, /x = l,...,M. (4) 

Clearly, the constraints (4) can only be satisfied if it is possible to classify correctly all the 
examples. In that case, there are no training patterns in a strip of width l/||w|| on both 
sides of the hyperplane, meaning that in the error-free regime l/||w|| = Kmax- An interesting 
property of the SVM solution is that the weight vector and the bias can be written as a 
linear combination of a sub-set of training patterns, the Support Vectors, having = 1. 

The minimization of (1) with k, = Umax is equivalent to that of (3) with condition (4) 
only if the training set is linearly separable. If errors cannot be avoided, the equivalence 
breaks down, as in one hand (1) has either negative Kmax, or several minima if K^ax > 
is imposed, and on the other hand the constraints (4) cannot be satisfied. This is why 
the second formulation has been generalized [10] through the introduction of a new set of 
variables C// ^ 0, called slacks, which are a measure of the "amount of violation" of the 
constraints. An increasing function of these is included in the cost function (3) and the hard 
margin conditions (4) are modified to allow some patterns to be closer to the hyperplane 
than l/||w||. The new problem amounts to minimize: 

1 M 

Ec,k^-w-w + CY.C,', (5) 

subject to the following conditions for /i— 1, ...,M 

h^>l- Cm> (6a) 
C, > 0. (6b) 

The coefficient C in (5) is a hyperparameter that allows to control the trade-off between the 
error term, defined by the slacks, and the regularization term, proportional to the squared 
weights. As will be shown in section IV, it may be selected to optimize the generalization 
performance. The exponent k in (5) modulates the relative cost of errors, depending on 
their distance to the hyperplane. Patterns in a strip of width l/||w|| at each side of the 
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hypcrplane, whether correctly or incorrectly classified, as well as those incorrectly classified 
outside of this strip, have (^^ > 0. l/||w|| is called soft margin, and the resulting classifier 
soft margin SVM or soft margin classifier (SMC). 

As the cost (5) is a quadratic function for A; = 1 and k — 2, and the domain of mini- 
mization defined by (6a) and (6b) is convex, the minimum is unique [12] . This remarkable 
property makes the new formulation attractive for applications, as it allows to get rid of the 
multiple minima appearing in other learning schemes. Like in the hard margin formulation, 
the solution {w, b} can be expressed as a linear combination of the support vectors, which 
now include the patterns with positive slacks. The corresponding coefficients may be ob- 
tained by solving the dual problem (see for example [13]) which, for A; = 1 or A; = 2 has a 
particularly simple expression [10]. Several efficient methods are known for solving this kind 
of problems, and this is one of the reasons why these classifiers are so widely used lately. 

In this paper we study the typical properties of the SMCs obtained by solving equation 
(5) subject to the conditions (6a) and (6b), with the methods of Statistical Mechanics, 
using the replica approach. It has been shown [14,15] that the statistical properties of 
SVMs in high dimensional feature spaces [16] can be well approximated by considering a 
simple perceptron learning anisotropically distributed patterns. The amount of anisotropy 
depends on the normalization of the mapping from the input to the feature space. In this 
paper we restrict to an isotropic pattern distribution, which corresponds to a non-normalized 
mapping. 

The learning properties of a perceptron learning an isotropic input pattern distribution 
have been extensively studied [17], mainly for linearly separable, i.e. realizable, tasks. In this 
case the hypothesis of replica symmetry is generally correct, allowing for a full analytical 
statistical mechanics calculation. In particular, the behaviour of the generalization error 
€g in the limit of very large a = M/N has a universal power law decay Eg ~ a''' with 
V — 1. Its prefactor allows to characterize the convergence to perfect learning of different 
learning algorithms. If the rule to be inferred cannot be generalized without errors, the 
task is called unrealizable. In this case the replica symmetric solution, although generally 
unstable, is believed to provide a good approximation of some learning properties. However, 
in the case of a linearly separable rule learned with noisy training patterns, which is thus 
unrealizable, the replica symnictric approximation gives an exponent v = 1/2 [2] whereas 
one step of replica symmetry breaking shows [18] that this exponent is modified to z/ = 2/3. 
As this is but an approximation to the full replica symmetry breaking scheme [19] at zero 
temperature, it is not clear whether this exponent is correct. The same exponent has been 
found in the case of a quadratic hard margin SVM learning a linearly separable task, that 
is, a rule simpler than those implementable with the student's architecture [16]. Another 
case of interest is that of inconsistent learning [6], which refers to realizable tasks learned 
with algorithms that do not strive to minimize the number of training errors. In this case, 
the exponent within the replica symmetric approximation was found to be = 1/2 [6]. 

As the soft margin problem has a unique minimum ior k — 1 and k — 2, even if the 
task is unrealizable, the replica symmetry hypothesis should be always correct, providing a 
framework for the study of complex classification tasks even when the mismatch between 
the student and the teacher hinders error-free learning. 

In this paper we present the statistical properties of SMCs learning several kinds of 
realizable and unrealizable rules. The model and the statistical mechanics approach are 
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presented in section II. The theoretical properties of SMCs with exponents A; = 1 and 
/c = 2 in the cost function (5) are obtained as a function of the training set size a = M/N 
in the thermodynamic limit N,M — > oo. Several teacher rules are considered in section 
III. One of our most striking results is that the generahzation error for large a exhibits a 
very rich variety of asymptotic behaviours, depending on the type of rule to be inferred. 
In particular, even if the task is realizable, the soft margin algorithm is inconsistent unless 
C — * oo. For finite C, we find that the fraction of training errors at finite a is finite, and the 
generalization error vanishes asymptotically with a following a. v = 2/3 power law. In the 
unrealizable tasks considered, eg converges to an asymptotic finite value either exponentially 
or with a power law with v — 1/2 . The usual exponent u — 1 only arises for error- free 
learning of a realizable task. In section IV we derive the best generalization performances 
of SMCs through the determination of the value Copt (a) that minimizes the generalization 
error. Finally we present a summary of our results in section V, together with some open 
questions. Most details of the proofs are left to the Appendix. 



II. STATISTICAL MECHANICS APPROACH 

We consider a student perceptron of weight vector w = (wi, . . . , Wn), without threshold. 
That is, we set 6 = in (6a). Given any A^-dimensional input vector x, the classifier's 
output is (T = sign(w ■ x): all the points lying on the same side of a hyperplane orthogonal 
to w containing the origin are given the same class. We assume that the perceptron learns 
the classification with the soft margin algorithm, using a set Cm — {(x^, T/i)}^=i,...,M of M 
examples or training patterns. These consist of input vectors drawn from an isotropic 
gaussian distribution of variance 1 / Vn, 

^^""^ " (27r/iV)^/2' 

and labels e { — 1,1} that represent the corresponding classes. The classification tasks 
considered in this paper are given by the following teacher's rule: 

r = sign(P(wo-x)), (8) 

where Wq is referred to as the teacher's vector hereafter, and V{z) is a polynomial of z. Each 
of its zeros Zi [20] defines a discriminating hyperplane at a distance |zj|/||wo|| from the origin. 
Rules of the kind (8) partition the input space in as many different regions as the number of 
zeros of the polynomial plus one, separated by parallel hyperplanes normal to the teacher's 
vector Wq. Patterns in successive regions belong alternatively to class +1 or —1. As only 
the zeros of the function V{z) matter, there is no loss of generality in our assumption that 
V{z) is a polynomial. We assume ||wo|| = V^, which is equivalent to imposing the unit of 
distance. Notice that the only rule realizable for the student perceptron considered in this 
paper is that of the linear teacher V{z) = z. 

In the following we study the properties of the solution to the soft margin problem 
using the by now standard tools of Statistical Mechanics [1,2]. That is, we assume that the 
ensemble of classifiers follows a Gibbs distribution defined by the energy function (5), at a 
fictitious temperature 1//?, and we take the zero temperature limit. The constraints (6a) 
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and (6b) play the role of infinite potential walls. Notice that the phase space in the present 
case has dimension 3?^+^, as not only the weights w but also the slacks {C/i}/i=i,...,M, have 
to be learned. The partition function is: 

„ M 

Zc,kW;CM,V) = / exp {-PEcA^, {(,})) U (r^w ■ - (1 - C^)) 0(C^) dwdC,. (9) 

The inverse temperature /3 has obviously no physical meaning whatsoever; it is only 
introduced in order to study the properties of the SMC which, being the single minimum of 
the energy function, is selected in the limit /3 — > cxd. We assume that the number of training 
examples scales with the input space dimension, M = aN, and take the thermodynamic 
limit — > oo, M — > oo with a = M/N constant. The free energy per input space dimension 
averaged over all the possible training sets of M patterns, fc,k{(3; V), is calculated with the 
replica method, that uses the identity 



where the overline represents the average over the pattern distribution (7), with labels given 
by (8). is the partition function of n independent replicas of the problem, that become 
coupled after taking the average. The typical properties of the classifier are obtained by 
taking the limit /3 — > oo. The free energy (10) turns out to be a function of the following 
order parameters: 



«.= ^^. (lla) 
9.= ^^. (lib) 

4=^^, (lie) 

where the brackets represent the phase space average and a and h are replica indices. The 
norm of the perceptron's weight vector, Qa, is one of the order parameters because in the 
soft margin problem the weights are not normalized as usually, qab is the overlap between 
two different weight vectors at temperature /S"^, and Ra is the overlap of the perceptron's 
solution and the teacher's vector. 

As for k = 1 and k = 2 the energy in (5) is a quadratic function in a convex domain, it 
has a single minimum [21], irrespective of the kind of rule that is being learned. Therefore, 
we may safely assume that all the replicas are equivalent, even in the case of learning 
unrealizable rules. We obtain thus the typical properties for cases where, using other more 
usual cost functions like the number of training errors, full replica symmetry breaking would 
be required [19]. The excellent agreement of the theoretical predictions and the numerical 
simulations presented in the following section is a further justification of our hypothesis of 
replica symmetry. Thus, we set = Q, Qab = Q and Ra = R, and we define the normalized 
overlap R = Rj^jQ., that only depends on the angle between w and Wq. 

Due to the unicity of the soft margin solution, only one point in phase space has non 
vanishing probability in the limit ^ oo, so that q ^ Q. It is convenient to introduce a 
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new parameter, x = j3{Q — q), which reflects how fast the fluctuations around the minimum 
of (5) vanish as /9 — * oo. In this hmit we obtain the typical free energy of the SMC learning 
a rule defined by the polynomial V, 

fckCP) = -extr{Q,^,,} {GoiQ, R, x) - aGc,k{Q, R, x; V)) , (12) 

where 

Go{Q,R,x)^^{l-R'-x), (13) 

is an entropic term. The dependence on the rule to be learned is embodied in the second 
term of (12) through V{z), and on the learning algorithm through k and C. Integrating out 
the slack variables in the limit /3 — > oo through a saddle point approximation, we get 

/oo roo 
Dy Dt mmW{C;y,t,Q,R,x,V), (14) 

-oo J<i>(y;Q,R,P) C 

where Dt = dt exp {-t'^/2)/^27r, 

4){y] Q, R, V) = ^l-E? ' ^ ' 

and 

W{C, y, t, g, R, x,V) = CC'+ ^ ^ ^ (16) 

In (14), due to the saddle point approximation, W{(; y, t, Q, R, x, V) has to be taken at its 
minimum C,{t, y) G [0, yjQ{l — R^)(f){y; Q, R, V)] for each couple {y, t). It is easy to see that 
there is a unique local minimum inside this interval for k > 1. For k = 1, W is a quadratic 
function of (, whose global minimum falls inside the allowed interval only for a finite range of 
values of t. Outside this range, the minimum lies at the boundary C = 0. As a consequence, 
for A; = 1 the inner integral in Gc,k splits into two parts. The results for A; = 1 and k — 2 
are respectively: 

GcAQ,R,^-,r) = Dt^^^^^g{t-R,V) + J^DtC{t^+l-^)g(t;R,V) (17) 

GcAQ, R, ^; r) = Dt ^^^^^lc' ^' ^^^^ 

with 

g{t,R,V)-J ^2,^^ _ -P 1^ 2(1 -R^) J" ^^'^ 

Deriving the free energy (12) with respect to Q, R and x gives three coupled equations 
for the order parameters. These in turn, determine the properties of the SMC. The explicit 
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expression of the saddle point equations for k — 1 and A; = 2 is left to the Appendix, where 
we also derive some general properties of the learning curves described in the next sections. 

The generalization error eg, which is the probability of misclassification of any pattern 
drawn with probability (7), is a geometric property that depends only on R and the rule to 
be learnt. In the case of rules of type (8), it is straightforward to obtain 



where H{x) = Dt. In the particular case of a linearly separable rule V{z) — z, (20) 
reduces to the usual expression eg = aiccos{R) / n . 

The distribution of stabilities = h^/\\w\\ of the training patterns, p(7), is given by 

= e(, - ^)^<,(-r, ^'^l^"*-^ + '^-^^ ^) 



The training error et is the average fraction of classification error on the training patterns. 
Integrating (21) over the negative stabilities we obtain 

e. = lmH ^«sign(P(t))+faC/Vg\ 

As expected, the training error is always strictly smaller than the generalization error. Both 
converge to the same limit for o; — > 00. 



III. LEARNING CURVES 

In this section we present the learning curves, namely the training error et{a) and the 
gcnerahzation error eg{a) of the SMCs for different teacher rules. We include in the figures 
the learning curves of the corresponding hard margin SVMs, or MSP, determined within 
the hypothesis of replica symmetry. In the case of unrealizable rules it is well known that 
the replica symmetry is broken for a larger than aMSP, the fraction of training patterns 
at which the hard margin Umax, positive for a < ausp-i vanishes. The results of computer 
simulations drawn on the same figures have been obtained by solving numerically the dual 
problem [13] using the Quadratic Optimizer for Pattern Recognition program [22], that we 
adapted to the case without threshold treated in this paper. The average has been taken 
over as many training sets as necessary (typically ~ 500 for small a and ~ 50 for big a) to 
ensure that the error bars are smaller than the symbols. These simulations are in excellent 
agreement with the theoretical predictions. 



A. The linear rule 



Introducing the expression V{z) — z corresponding to a linearly separable teacher's rule 
in (19), we obtain: 
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g{t;R,r)^2H(^-^^=^ (23) 

The training and generalization errors, obtained after solving the extremum equations 
for different values of the hyperparameter C, are plotted against a on Figures 1 and 2 for 
k = 1 and k = 2 respectively. The generalization error of the hard margin classifier, solution 
of (3) with conditions (4), and that of the optimal bayesian generalizer [23], which are both 
error-free solutions, are included on the figures for comparison. Despite the fact that the 
task is realizable by the student perceptron, the training error for finite C is finite. It goes 
through a maximum and vanishes asymptotically in the limit a — oo. As expected, both 
for k = 1 and k = 2 at any a, is larger the smaller the value of C, which controls the 
relative importance of the error term in the cost function (5). We can also see from the 
figures that, given C, the machine with k = 2 performs better than the one with k = 1. On 
increasing C, the learning curves approach those of the MSP. In fact, by taking the limit 
C — > oo in our saddle point equations we get exactly the equations of the MSP for every 
value of a, independently of the power k. This is not surprising, as in this limit the error 
term dominates completely the soft margin cost function (5), which can only be minimized if 
all the slack variables, and consequently the training error, vanish. This is possible because 
the rule is realizable. It is well known that the generalization error of the MSP is larger than 
that of the bayesian generalizer even asymptotically, as for ck — > oo both algorithms have 
eg ~ a/a, but a — 0.5005 in the case of the MSP [24], whereas a — 0.442 for the bayesian 
perceptron [23]. 

The obtained behaviour of the learning curves at finite C is reminiscent of that arising 
with other learning algorithms having a hyperparameter. In the inconsistent algorithms 
studied by Meir and Fontanari [6], patterns closer to the hyperplane than a finite imposed 
distance k > Kmax contribute to the cost, linearly in the case of the perceptron algorithm 
and quadratically in the case of the relaxation one. In the algorithm Minimerror [24] the 
hyperparameter is equivalent to a learning temperature. By training with these algorithms, 
as well as with the SMC studied here, the generahzation error can be made smaller than 
that of the MSP by choosing appropriate values for the hyperparameters, at the price of 
learning with errors. The reason is that, in contrast with the MSP, the bayesian solution 
presents a finite fraction of training patterns at any distance of the hyperplane [8]. Thus, 
solutions with a small controlled fraction of training errors may be closer to the optimal 
bayesian hyperplane than the MSP, which has no patterns at distances smaller than Umax- 

Unlike the generalization error of the inconsistent learning algorithms, that vanishes 
asymptotically hke eg ~ 1/ \/a [6], SMCs with finite C present a faster power law decay: 

^9^^\. (24) 

where the constant eg is larger for k = 1 than for k = 2. In the limit C — > oo eq. (24) 
no longer holds, and the well known decay eg ~ characteristic of error-free trained 
perceptrons learning realizable tasks is recovered. 

Independently of the value of C, both the regularization term, proportional to Q, and 
the slacks term diverge like ~ a^/^ for a — oo. In fact, this divergence arises because 
we divided the free energy in (10) by A^, instead of dividing by A^(l -|- a), which gives the 
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energy per degree of freedom. In the large a limit, this converges to as it should, like 
^-1/3^ The separable case is the only one where the error term in the cost function presents 
the same asymptotic behaviour as the regularization term. In this limit, the soft margin 
^1 \fQ vanishes like aT^I^ ^ in contrast with the hard margin behaviour, Kmax ~ [24]. 



B. The shifted linear rule 

Next we analyze the case of a linear teacher with a bias 5 > 0. The corresponding 
polynomial has a single root: V{z) = z — 5. This teacher separates linearly the examples 
with a hyperplane at a distance 5 from the origin. As the student perceptron has no bias 
(6 = 0), zero generalization error cannot be achieved: this rule is unrealizable. The lowest 
value of eg, obtained by taking the asymptotic limit it! — > 1 in (20), is = 0.5 — ii {5). 

The function g defined by (19) is: 

/ Rt + 5 \ f Rt-6 \ 

Learning curves for different values of C are represented as a function of a on Figure 3, for 
the particular value S — 0.3. The training error of the MSP is zero up to aMSP, at which the 
maximal stability Umax vanishes. Qmsp is a decreasing function of S. It diverges at 5 = 0, as 
the problem becomes separable, and tends to the perceptron's capacity etc = 2 in the infinite 
S limit. aMSP cannot be smaller than ac since in the thermodynamic limit any training set 
can be learned without errors for a < ac [25]. Within the replica symmetry hypothesis, the 
MSP's training error displays a discontinuous transition at a = aMSP- For a > ausPi 
et = eg. The generalization error does not present any singularity at « = ausp- As already 
mentioned, ^max becomes negative for a > aMSP, and the cost function Emsp (1) is likely 
to present several disconnected minima. Thus, the hypothesis of replica symmetry used to 
draw the MSP's learning curves in Figure 3 is most probably wrong. 

If we take the limit C — > oo in our equations, we get those corresponding to the MSP 
only for a < aMSP- At aMSP: the training error of the SMC starts increasing and the 
generalization error curve detaches down from that of the MSP, both through a second 
order phase transition. The learning curves obtained in the limit C ^ oo are different 
for A; = 1 and k — 2, in contrast with the realizable rule considered before, in which they 
converge to that of the MSP irrespective of the value of k. 

For finite values of C the transition at aMSP becomes a crossover both for and eg, at 
values of q; < aMSP that decrease on decreasing C. The training error for all a is larger 
than that for infinite C, both for k = 1 and k = 2. The generalization errors for different 
values of C cross each other as a function of a. The envelope of the curves eg{a) corresponds 
to the lowest possible value of eg reachable by the corresponding SMCs. It depends on the 
exponent k. Notice that for large enough values of a the replica symmetric approximation 
to the MSP's generalization error is smaller than that of the optimal soft margin solutions, 
and seems to provide a lower bound to eg for the SMCs. 

The convergence of the generalization error to its asymptotic limit, for all values of C, 
is exponentially fast with a: 

eg - 6~ c exp(-f ) (26) 
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The decay constant au does not depend on C . A stronger exponential drop of the gener- 
alization error, with in the exponent, has been found for SVMs learning "easy" teacher 
rules. These not only are realizable, but present a gap in the patterns distribution close 
to the discriminating surface. In contrast, here the student's hyperplane is surrounded by 
unlearnable patterns. The student cannot get rid of the errors by decreasing the soft margin, 
like with the linear rule. On increasing a, Q converges to a constant that depends on k and 5 
while the error term in (5) increases with a. For large enough a, the cost function is mainly 
dominated by the error term, and then C only plays the role of an irrelevant multiplicative 
constant. This is why the convergence rate to the asymptotic value of the generalization 
error does no depend on C. 

Similar results are obtained for A; = 2, as is shown on Figure 4. 

C. Sandwich Rule 

Consider now rules of the form V{z) = z{z — S), where the polynomial defining the 
teacher's output has two roots. The corresponding discriminating surfaces are two parallel 
hyperplanes, one containing the origin and the other at a distance 5/y/N of it. The patterns 
that lying between the hyperplanes belong to class +1, the others to class —1. Thus, not 
only these are unrealizable rules, but the classification errors will necessarily correspond to 
patterns at large distance of the student's hyperplane. 

As with all the unrealizable rules, the training error of the MSP within the replica 
symmetric approximation presents a discontinuity at umsp where Kmax vanishes. Here 
aMSP is an increasing function of d, starting at Umsp = 2 for 5 = 0, which corresponds to 
the most difficult learning task and diverging for 6 oo. The generalization error starts 
decreasing at small a, reaches a minimum beyond Qmsp and then starts to increase, and 
tends asymptotically to = 1/2 for a — > cxo. Notice however that for a > ausp the replica 
symmetry is most probably broken. 

The properties of the SMC are obtained by replacing 

in the saddle point equations (31-33) of the Appendix. 

The learning curves for different values of the hyperparameter C, corresponding to a 
width 5 — 2, are represented on Figures 5 and 6 for k = 1 and k = 2 respectively. Given 
C, for large enough a, the training error curves et{a) for k = 1 are below those for k = 2. 
This is so because the unavoidable errors, which are very far from the hyperplane, are more 
heavily penalized ii k — 2. Thus, the SMC tries to learn these examples even if this increases 
the overall number of errors. As a result, learnable patterns close to the hyperplane, that 
have small slacks, are incorrectly classified. This can be checked up by taking a look at the 
distribution of stabilities. Figure 7. 

Like with the previous shifted linear rule, the norm of the student's weight vector Q 
tends to a constant value and therefore, the error term dominates the cost function in the 
asymptotic limit a — > oo. However, instead of the exponential convergence, the generaliza- 
tion error decays asymptotically to = H (5) like a" 2. The reason of this difference is 
discussed in section V. 
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D. The Reversed Wedge 



Teachers defined by third order polynomials like V{z) = z{z — 6){z + 6) with 6 > 0, corre- 
spond to the so called Reversed Wedge [26] rules. Patterns with wq-x^ G (— oo, — (5)U(0, 6) 
belong to class —1, those outside this subspace to class +1. The generalization properties of 
a perceptron learning a reverse wedge teacher have been addressed in [26], and within the 
on-line paradigm, using Hebb's learning rule in [27]. 

The behaviour of the replica symmetric approximation to the MSP is as described in the 
previous section, but here aMSP diverges both in the limits of vanishing and infinite wedge 
width 5, for which the problem becomes separable, and has a minimum at 5c — -\/2 In 2 [26] . 
At this value of S the patterns stability distribution along the teacher's weight Wq has zero 
mean. Correspondingly, learning becomes impossible for the MSP, as is discussed later. 
Thus, for 6c, R = for every value of a, and ausp = 2 is equal to the perceptron's capacity. 

The properties of the SMCs are deduced after insertion of 

,(t.,a,V) = 2H[-^yH[-^yH[-^=) (28) 

into the saddle point equations. 

In contrast with the problems considered before, the generalization error of a perceptron 
learning the reversed wedge rule is a monotonic function of R only if 6 > 5c [27]. For 
Q < 5 < 5ci ^g{R) presents a relative minimum at Rmin > and a corresponding maximum 
at —Rjnin- The relative minimum is the global one only for < 5 < 5* = 0.570185. At 5* 
the global minimum jumps to i? = — 1, and for 5^ < 5 < 5c the generalization error takes 
its smallest value at i? = — 1. At 5 = 5c the relative extrema collapse at the inflexion point 
Rmin = 0, and for larger values of 5 the generalization error becomes a monotonic increasing 
function of R. This behaviour is represented on Figure 8. 

For the values of k investigated, R has two distinct behaviors as a function of a, de- 
pending on the wedge's width 5. li 5 < 5c, the teacher's average stability is positive, and 
R{a) is a monotonic continuous function growing from to its asymptotic value -|-1. In 
this range of small wedges, the soft margin learning algorithm does not converge to the 
minimal value of the generalization error in the limit of infinite a, as is the case in the other 
tasks considered before. In fact it "overshoots" , in the sense that R{a) continues to grow 
beyond the value that optimizes the generalization performance. Correspondingly, eg{a) 
goes through a minimum at finite a but, as R increases with a, it converges to a larger 
value, = eg{R = 1) . The learning curves of Figure 9 are an example of this behaviour. 
Notice that for 5^ < 5 < 5c this value of corresponds the largest value of the student's 
generalization error. Moreover, for 0.67449 < 5 < 5c the asymptotic behaviour is even worse 
than a random guess, because eg{R = 1) > 0.5. 

At 5 — 5c there is an abrupt change of the learning behaviour, as beyond this wedge's 
width the average teacher's stability is negative, and R becomes a decreasing function of a. 
Correspondingly, the soft margin solution converges to the optimal generalizer in the limit 
q; — > oo. This corresponds to R— —1, because for large 5, most of the patterns lie in inside 
the reversed wedge, so that the student's weight vector tends to orient antiparallel with the 
teacher's vector Wq, in order to classify correctly most of the examples. Learning curves for 
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5 — 2 > 5c obtained with exponent k — 1 for the slacks exponent in the cost function are 
represented on Figure 10. 

As for the sandwich rule, the generalization error decays as a^a to the corresponding 
asymptotic values, e~ = 1 - 2H{5) for i? ^ 1, and = 2H{5) for R -1. The same 
asymptotic behaviors for Cg and i?, but with different prefactors, were obtained by Inoue et 
al. [27] for the online Hebbian learning scenario. 

The asymptotic value of Q tends to zero as 5 tends to 5c- In the two limiting cases 
5 — > oo and 5 — > 0, the task becomes linearly separable and correspondingly Q — > oo. 

For the particular case ol5 — 5^ the only solution of the saddle point equations is i? = 
for every value of a. This "no learning" regime is discussed in section V. 



IV. OPTIMIZATION OF THE HYPERPARAMETER 

The figures of the preceding section show that the behavior of the generalization error 
of the SMC is not monotonic with C . It can be seen that there is an optimal value Copt{ot) 
that allows to obtain the minimum generalization error for each a. Obviously, Copt cannot 
be calculated using the training examples alone, so that in the applications it can only be 
estimated. Several methods for doing this have been proposed recently [28,29]. Here we 
determine the statistical properties of the optimal SMC, thus providing reference curves 
against which results obtained using the different estimators may be tested. 

As Eg depends implicitly on C through i?, in the cases where tg is a monotonic function 
of its minimum is obtained by looking for the extremum of R with respect to C, at fixed 
(^1 Copt{a). To this end, the three saddle point equations (31-33) of the Appendix, together 
with their derivatives with respect to C, constitute a system of 6 coupled equations for the 
variables Q, R, x' = xC , dQ/dC, dR/dC and dx' jdC. Setting the extremum condition 
dR/dC = 0, the equations obtained by derivation of (32) and (33) form a homogeneous 
system for dQ/dC and dx'/dC. The only nontrivial solution is obtained by setting the 
determinant of this system to zero, which gives 

dQdR dx^ dRdx dQdx ' ^ ' 

where / stands for the free energy (12). Solving the system given by equation (29) together 
with the three original saddle point equations for Q, R, x' and C, we get Copt in the cases 
where is a monotonic function of R. 

In the other happens with the Reverse Wedge rule, determining Copt is less 

straightforward because the minimum of eg may be reached for a value R* (different from 
±1) such that deg/dR{R*) = 0, with dR/dC ^ 0. In that case, C^^t is the one that gives 
R{Copt) — R*: and has to be determined numerically. 

The optimal generalization curves for the different rules considered in this paper are 
represented on the figures of the preceding section. Notice that for a < ausp^ the MSP 
is not optimal for any value of a, as it is obtained in the limit C oo. In the case 
of the realizable linear separation, the optimal generalization error of the SMC vanishes 
asymptotically as 0.488q;~^ for k — 1, and as 0.449q;~^ for k — 2. The latter is very close to 
that of the bayesian perceptron, 0.442q;~^, but the curves are also very close for finite values 
of a, as can be seen on figure 2. Notice that the asymptotic decay of Eg for the SMC is faster 



13 



than that of the MSP, even for /c = 1. This is an interesting result, as it shows that, even 
when a hard margin solution exists, learning with a soft margin machine allows to obtain 
better classifiers. 

For the non separable cases, even if Copt allows to obtain the best performances at finite 
a, since the asymptotic behavior of Eg is independent of C, all the learning curves, including 
the optimal one, tend to a value that only depends on the rule and on A;, as shown in the 
corresponding sections. 

The evolution of Copt with a can be seen on Figures 12 to ??. The behaviour of the 
curves is qualitatively similar for the shifted linear rule and the reversed wedge with small 
8 on one hand, and for the sandwich rule and the reversed wedge with large 6 on the other. 
The divergences of Copt are related to the presence of errors with unbounded slack values. 
For a beyond the divergence, Copt — oo- 



In the preceding sections we presented the learning curves of a SMC learning a variety 
of rules, characterized by an anisotropy axis parallel to the teacher's vector Wq. Some of 
the obtained results, and in particular the asymptotic behaviour in the a ^ oo limit, can 
be generalized to other teacher rules (Proofs are detailed in the Appendix). As shown by 
Reimann and Van den Broeck [30] , it is useful to characterize the teacher rules by the average 
patterns' stability of a perceptron aligned with the teacher's vector. 



where the second equality in (30) stems from our assumption (7) that the patterns' distri- 
bution is a gaussian. 

In the Appendix we show that in the limit a — > oo, both for A; = 1 and k — 2, R 
converges asymptotically either to 1 or to —1, that is, the student perceptron gets either 
completely aligned or completely anti-aligned with teacher's vector. Furthermore, for non 
separable rules, 1 — i?^ ~ 1/a. In this limit of i? ^ ±1 we find ^ = / Dz6{^zV{z)), 
irrespective of the teacher's rule. The convergence law to this asymptotic value depends on 
whether the polynomial V{z) defining the rule in (8) has or not a root Zi — 0. If is not a 
root of V{z), V{0) 7^ and — ~ exp(— £/(l — i?^)) with e a constant, whereas if is a 
root, then the decay follows the law — ~ Vl — i?^. 

Thus, for the unrealizable rules that have as one of the roots of V, the generalization 
error decays to the asymptotic value as — ~ q;~2. A similar result has been obtained 
by Amari et al. [31] within the annealed approximation for the case of a deterministic 
machine learning a noisy teacher, and by other authors for hebbian learning of unrealizable 
tasks [27,4]. The same power law has been obtained by Meir and Fontanari [6] for a realizable 
problem learned with inconsistent algorithms, within the approximation of replica symmetry, 
which is probably not valid for large values of a. Indeed, the soft margin algorithm with 
finite C is also inconsistent when the rule is the linear separation considered in section III A, 
and in that case we obtain a different power law decay. 

In the case of a linearly separable rule, the SMC with Copt has eg ~ 1/a. like the MSP, 
which corresponds to C = oo. However, at fixed finite values of C the decay is slower, like 



V. DISCUSSION 




(30) 
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~ Xja^l"^ . The same exponent has been obtained for a pcrccptron learning a separable rule 
using noisy examples with one step of replica symmetry breaking [18]. Within the replica 
symmetric approximation to the same problem the exponent is 1/2 instead of 2/3 [2] . 

In the cases where is not a root of V{z)^ hke for the shifted linear rule, the decay is 
exponential, e^, — ~ exp(— £ - a). A similar behaviour was found in [4] for a perceptron 
with linear output and binary weights, trying to learn examples given by a teaches with the 
same structure, where the generalization error vanishes exponentially. 

The presence or the absence of a root = induces different asymptotic behaviors 
because if is a root, then a student perceptron aligned with the teacher has \R\ — \ 
and can perfectly separate the patterns closest to the hyperplane. In that case, any small 
misalignement modifies the classification induced by the student, thus strongly modifying 
the error term in the cost function. On the other hand, if is not a root, the student's 
hyperplane is immersed in a sea of patterns of the same class. Small tilts of the hyperplane 
do not change significantly the classification nor the slacks term in the cost. 

It is interesting to notice that the figures of the learning curves as well as those of Cop* 
show an analogy between the behaviour for the SMCs with bounded slacks, like in the case 
of the shifted linear rule and that of the reversed wedge when 5 < and between those 
with unbounded slacks, as is the case with the sandwich rule and the reversed wedge when 
5 > ^c- For this last type of rules, Copt diverges beyond some finite a. 

Consider now the small a limit. As shown in the Appendix, R ~ (7) ^Ja. and so, 
Eg ~ 1/2 — {pi)'^\foL. Thus, irrespective of the rule considered, when the fraction of train- 
ing examples is small, the SMC generalizes better than by random guessing. This is not 
necessarily the case for larger values of a. 

If we put i? = in the equations, and solve for a, the only possible solution when (7) 7^ 0, 
is q; = 0. Thus, 7^ for all a, and has the sign of (7) unless it has discontinuous changes of 
sign. Notice that, given the asymptotic behaviours just mentioned, if R is discontinuous it 
can only have an even number of changes of sign. A similar result has already been obtained 
in a broader frame [30]. From the behaviour of R in the small a limit, it can be seen that 
the problem gets very difficult to learn for rules with (7) close to 0. In fact, in the very 
special case of (7) = 0, i? = is a solution of the saddle point equations for every value of 
a. If this is the only solution, the machine cannot learn at all, as is the case for the reverse 
wedge rule when 5 = 5^. This behaviour is similar to the one of retarded learning, found 
in problems of unsupervised learning with quadratic cost functions [30]. In that case, it 
has been shown that learning is still possible, provided that the cost function is capable to 
extract the information about the anisotropy of the distribution of stabilities, contained in 
its higher order moments [32] . Notice that this is not the case for the cost functions for the 
SMCs considered in this paper. 

VI. CONCLUSION 

The properties of the recently proposed Support Vector Machines have been studied 
theoretically in two situations of interest, namely for the cases where the student has either 
the same structure as the teacher, or it is more complex than it. In both situations the rule to 
be learned is realizable, and interesting properties of hard margin SVMs, like the existence of 
hierarchical generalization, could be analyzed within the replica symmetry hypothesis [16]. 
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In the present paper we addressed the situation where the task is more complex than 
the learning machine. In this case the cost function for the SVMs is modified. It allows to 
obtain a Soft Margin Classifier that results from a trade-off, controlled by a single parameter 
C, between increasing the margin and minimizing the number of training errors. As the 
cost function is quadratic and the domain of solutions is convex, we obtain the typical 
learning curves for a variety of unrealizable tasks using the replica symmetry hypothesis. 
We considered problems characterized by a single symmetry-breaking direction wq, along 
which the patterns have alternating positive or negative class label. We have shown that 
the convergence of the corresponding learning curves to the asymptotic value follows either 
a power law or an exponential, depending on the position of the singularities of the teacher's 
rule. 

Even if the student is well adapted to the task's complexity, the SMC may generalize 
better than the error-free hard margin SVM, provided the hyperparameter C in the cost 
function is correctly tuned. It can even attain almost Bayesian performance. 

We showed that the prefactors of the different asymptotic behaviours are proportional 
to the average stability of the teachers rule, (7). When this vanishes, the SMC with cost 
function (5) cannot learn, and the overlap between the student and the teacher directions 
is i? = 0. We considered two exponents for the error term in the cost function, k = 1 and 
k — 2. It would be interesting to study the properties of SMCs trained using exponents 
A; > 2 in the cost function, as we expect that these should detect the difference of the odd 
moments of the patterns distribution in the directions parallel and orthogonal to wq. 

Another interesting question is whether the hierarchical learning of hard margin SVMs 
exists also with SMCs. To tackle this question, pattern distributions with two different 
anisotropics have to be considered. 
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VIII. APPENDIX 



The saddle point equations for the cases k = 1 and k — 2 are: 




(31) 
(32) 
(33) 



with, for the case k — 1 
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r^^^ 1 [■'^ txC 

h{xC,Q,R-l)= / f Dtt{t+^)g{R,t,P)+ L ^ Dt ^g{R,t, P), 

■^Tq W 

KTrOR-D-f^^Df^a^ 1 .2 dg{R,t,P) , 2-xC dg{R,t,P) 

HxC, Q, R, 1) - / Dt- {t + + y + 9^ : 

V Q V Q 

h{xC,Q,R;l)= / f Dt(i + -=)V^,t,P)+ ^Dt^--Lg{R,t,P) 



and, for the case k = 2, 



h{xC, Q, R; 2) = + to) ^' ^) (37) 



1 + 2a;C ^ VQ^ 



oo 



,,i.C,Q,R;2)=J Dt^it^^f'-i^ (38) 

V Q 



roo (2xCV 1 

73(xC, Q, R; 2) = (1^2^ + 7^^'^^^' ^' ^^^^ 



Prom (39) it can be seen that, for k — 2, x must vanish in the infinite a hmit in order 
to make 73 vanish. Notice that the function g{R,T,P) is always nonnegative (19). For the 
case k = 1 the analysis of (36) shows that x must either vanish or tend to a positive constant 
with q tending to infinity. This last case can be ruled out by noticing that it is inconsistent 
with the vanishing of 12 (notice that (35), as well as (38) can be solved analitically) . 

To show that R can only tend to 1 or —1 in the infinite a limit, it is useful to rewrite II 
and 12, which in the case k — 1 are 

xC-l 

h{xC,Q,R;l) = l_f Dtg{R,t,P) + Rl2{xC,Q,R;2) (40) 



h{xC, Q, R- 1) = Y,r{xt f-j={ fj^ Dt {tVT^ + ^ + x,R) 
xC f°2 



+ -7^ rMz!i + "^^)> (^^) 

and for k = 2, 



2xC r°° 

h{xC, g, R- 2) = ^^9{R, t, P) + Rh{xC, Q, R; 2) (42) 

+ + Xii?) + {Xi ^ -Xi)}. (43) 

Let us suppose that R tends to a constant different from 1 and —1 as a tends to infinity. 
It can be seen that in that case II, 12 and /3 must vanish at the same rate. If we consider 
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teachers with at least one positive root , i. c. unrealizable teachers, it can be seen that the 
integral in 13 (the second one for the case k=l) never vanishes. Thus, /3 must vanish as 
{x/y/Qy for k=l and as x"^ for k=2, if Q tends to a constant or to infinity. But equations 
(40) and (42) show that II and 12 cannot vanish at the same rate as 73 because the first 
term on the right handside vanishes as x/y/Q ior k — 1 and as x ior k — 2. HQ tends to 
then 13 must vanish as {x/ ^/QY for both cases. But then 12 cannot vanish at the same rate, 
because equations (41) and (43) show that 72 must vanish as xC{'y)/^/Q, unless (7) = 
(this case will be analyzed below). Therefore, R tends either to 1 or to —1 for all teachers 
with (7) 0. 

By putting R — in the equations one can easily (notice that g{0, t, P) = 1) see that if 
(7) 7^ 0, it can only be a solution for a = 0. On the other hand, for (7) = 0, = is a 
solution for every value of a, i. e. learning is impossible for this kind of teacher. 

It is also possible to find the condition that makes R go to each one of its limiting values 
(1 or -1). From what has been said before regarding 73 it can be seen that it vanishes as 
x^, and so, 1 — 7?^ ~ ax^. Using this, and equation (31) it is evident that 71 must vanish 
faster than x. But, in the infinite a limit, 71 is written, to first order, 

-1 

h{xC,Q,R;l) ^ ^{stgn{R){^) + Dtt g{±l,t,P)} (44) 



h{xC, g, 7?; 2) - -x{sign{R)^ - Dtg{±l, t, P)} 



00 

00 



1 . e-'-"' 



sign{R) J2 ^ r{xt){\x,\-^)-^} (45) 



Thus, the term within brackets must vanish. For (44) it is evident that this can only 
happen if R — > sign{{^)). The same can be shown for (45), with a bit of algebra. The 
asymptotic value of Q can be obtained by imposing the vanishing of the above mentioned 
terms. 

To sec the rate of decay of 1 — 7?^, notice that, from (33) and from the fact (shown above) 
that 73 ~ x^, one gets that 1 — 7?^ ~ ax'^. But, using the fact that 71 must decay faster 
than x, equations (41) and (43) impose that 72 ~ x. This, together with (32), gives that 
X ~ 1/a. Therefore, 1 — 7?^ ~ l/a. 
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FIG. 1. Linearly separable rule. SMC's learning curves (ej below, Cg above) corresponding to 
an exponent /c = 1 in the cost function, for different values of the hyperparameter C. The gener- 
alization errors of the MSP and the optimal (bayesian) generalizer, are included for comparison. 
The learning curves of the optimal SMC, discussed in section IV, are also represented. Symbols, 
et in black, eg in white, correspond to results of computer simulations with N = 100. Error bars 
are smaller than the symbols. 
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FIG. 2. Linearly separable rule. Same as the preceding figure, with an exponent A; = 2 in the 
cost function. 
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FIG. 3. Shifted linear rule. SMC's learning curves corresponding to an exponent A; = 1 in 

the cost function, for different values of the hyperparameter C. Symbols correspond to results of 
computer simulations with N = 50. Error bars are smaller than the symbols. The figure in the 
right shows the difference between the MSP within the replica symmetry approximation and the 
SMC learning with C = oo. Asymptotically, = 0.1179. 
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FIG. 4. Shifted linear rule. Same as the preceding figure, with an exponent k = 2 in the cost 
function. Simulations results correspond to A'^ = 100. 
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FIG. 5. Sandwich rule. SMC's learning curves corresponding to an exponent k = 1 in the cost 
function, for different values of the hyperparameter C. Asymptotically, = 0.023. 




FIG. 6. Sandwich rule. SMC's learning curves corresponding to an exponent = 2 in the cost 
function, for different values of the hyperparameter C. Symbols correspond to results of computer 
simulations with N = 100. Asymptotically, = 0.023. 
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Sandwich rule, 5=2 
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FIG. 9. Reverse wedge rule with 8 = 0.3. SMC's learning curves corresponding to an exponent 
A; = 1 in the cost function. The optimal value of the generalization error is e°^* = 0.178, but the 
SMC converges asymptotically to = 0.235. Simulation results correspond to N = 100. 
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FIG. 10. Reverse wedge rule with 6 = 2. Learning curves obtained with different values of the 
hyperparameter C, with A; = 1 in the cost function. Asymptotically, = 0.0455. 
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FIG. 12. Optimal values of the hyperparameter Copt for unrealizable rules. 
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